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Abstract 

Background: A challenge in gene expression studies is the reliable identification of differentially expressed genes. 
In many high-throughput studies, genes are accepted as differentially expressed only if they satisfy sinnultaneously a 
p value criterion and a fold change criterion. A statistical nnethod, TREAT, has been developed for microarray data 
to assess fornnally if fold changes are significantly higher than a predefined threshold. We have recently applied the 
NanoString digital platform to study expression of mouse odorant receptor genes, which form with 1,200 members 
the largest gene family in the mouse genome. Our objectives are, on these data, to decrease false discoveries when 
formally assessing the genes relative to a fold change threshold, and to provide a guided selection in the choice of 
this threshold. 

Results: Statistical tests have been developed for microarray data to identify genes that are differentially expressed 
relative to a fold change threshold. Here we report that another approach, which we refer to as tTREAT, is more 
appropriate for our NanoString data, where false discoveries lead to costly and time-consuming follow-up experiments. 
Methods that we refer to as tTREAT2 and the running fold change model improve the performance of the statistical 
tests by protecting or selecting the fold change threshold more objectively. We show the benefits on simulated 
and real data. 

Conclusions: Gene-wise statistical analyses of gene expression data, for which the significance relative to a fold 
change threshold is important, give reproducible and reliable results on NanoString data of mouse odorant 
receptor genes. Because it can be difficult to set in advance a fold change threshold that is meaningful for the 
available data, we developed methods that enable a better choice (thus reducing false discoveries and/or missed 
genes) or avoid this choice altogether. This set of tools may be useful for the analysis of other types of gene 
expression data. 
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Background 

Multiplex gene expression studies that are based on 
microarrays and next-generation sequencing result 
in the generation of large datasets. The simulta- 
neous analysis of the expression of thousands of 
genes in these high-throughput approaches challenges 
statistical methods and data interpretation. Traditional 
statistical tests and analysis tools are therefore often 
insufficient. 
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One of the most popular types of biological experi- 
ments is a two-sample comparison. Gene expression 
studies often seek to identify genes that are Differentially 
Expressed (DE) between RNA samples from two types 
of biological conditions, such as gene knockout mice 
compared to wild-type mice. DE genes can give insights 
into biological mechanisms or pathways, and form the 
basis for further experiments. The traditional statistical 
method for identifying DE genes between two samples is 
the student s t-test. But a microarray comparative experi- 
ment is faced with simultaneously assessing a large num- 
ber of genes based on a small number of biological or 
technical replicates. Assessing such a large dataset with 
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a small sample size challenges statistical methods: deal- 
ing with many genes but few replicates may lead to large 
Fold Changes (FC) driven by outliers, and to small error 
variances [1,2]. In order to overcome such problems, the 
t-test has been modified for microarray data analysis. 
The general idea behind these modifications is to obtain 
more stable estimates of the error variance of a given 
gene by "borrowing" information from all available genes. 
This goal is often obtained by applying (empirical) 
Bayesian methods. Examples of these modified tests for 
microarray data are the SAM test [3], the regularized 
t-test [3,4], and the B-statistic [1,5], as reviewed [2,6,7]. 

Other statistical approaches for microarray data ana- 
lysis have introduced linear models [2,8-10]. These models 
allow for more flexibility, for instance when comparing 
more than two samples or introducing additional sources 
of variation. Such an example is Lin et al. [11], where a 
complex experimental design and research goal are ad- 
dressed by setting various contrasts in the design of the 
linear model. The bioconductor package limma, devel- 
oped by Smyth [12], applies a gene-wise linear model, and 
allows for the analysis of complex experiments (compar- 
ing many RNA samples), as well as more simple replicated 
experiments with two RNA samples. The package has fur- 
ther developed the ideas of Lonsstedt and Speed [1], 
which have been reset within the framework of linear 
models in order to address the challenges of microarray 
data analysis. The statistic (applied in limma) to identify 
DE genes is referred to as the moderated t-statistic, 
denoted by t* [5,12]. In the calculation of t*, shrinkage of 
the estimated error variances towards a pooled estimate is 
obtained through an empirical Bayes approach. 

For certain biological problems, it is important to rank 
genes according to their FC or to impose on a gene to 
attain a predefined FC threshold before calling it DE. In 
such situations there remains often a disconnect be- 
tween the concepts of a statistically significant differen- 
tial expression (based on the p value) and a biologically 
meaningful differential expression (FC higher than a 
predefined threshold value). Even the modified tests 
may result in genes with small FCs to be considered 
statistically significant. 

In order to integrate these statistical and biological 
concepts, a gene can be defined as DE when it satisfies a 
p value and a FC criterion [9,13-15]. The advantage of 
this combined approach is that the t-test or any of the 
modified tests can be combined with an ad-hoc FC cri- 
terion. The disadvantage of an ad-hoc FC criterion is 
that it does not take into account error variance and 
therefore offers no statistical confidence about future 
results. In addition, depending on the choice of the FC 
criterion and significance criterion, various interpreta- 
tions of the same dataset are possible [16]. The moder- 
ated t-statistic [5] has been extended by McCarthy and 



Smyth into a new "test relative to a FC threshold", 
abbreviated TREAT [17]. This method assesses formally 
whether the true differential expression is greater than a 
predefined FC criterion. TREAT offers greater specificity 
and reproducibility in identif)^ing DE genes, compared to 
the combined approach of statistical test and ad-hoc FC cri- 
terion. TREAT has been also added to the limma package. 

We have previously applied the NanoString digital 
platform [18] to study the expression of odorant recep- 
tor (OR) genes in mice [19,20]. In contrast to microarray 
data, where analog levels of fluorescent intensity are 
measured, NanoString data represent digital readouts of 
single molecules in the form of probe counts. These 
probes contain unique fluorescent bar codes, and RNA 
abundance of up to 800 genes can be analyzed in a 
single reaction in a single tube. The NanoString technol- 
ogy thus places itself between qPCR and microarrays in 
terms of throughput level [21]. We have analyzed with 
NanoString the expression of half of the OR gene reper- 
toire of -1200 genes [19]. Here, we have explored statis- 
tical tools for our NanoString data, and developed a 
systematic approach for identifying DE genes with re- 
spect to a given FC threshold. 

We explored the moderated t-statistic (t*), the deriv- 
ation of which was driven by microarray data (high 
throughput), versus the classical t-statistic on compara- 
tive NanoString experiments (medium throughput). We 
found that t* does not show a protective effect (i.e. fewer 
false discoveries) over t on our NanoString data. But we 
also wanted to test whether differential expression is 
greater than a FC threshold. Therefore we used the ana- 
lysis relative to a FC threshold together with the classical 
t-statistic in two approaches. The first approach is similar 
to TREAT as published for t* [17], and we refer to it as 
tTREAT. Then we addressed the arbitrary choice of the 
FC criterion itself, and developed a two-stage approach, 
tTREAT2. We describe the performance of TREAT, 
tTREAT, and tTREAT2 on our NanoString data, both 
in data simulation experiments and on biological data. 

The variability of the FC of a gene is inversely related 
to the expression level of that gene; lowly expressed 
genes tend to have a greater error in their measured FC 
levels [4,22]. These lowly expressed genes can thus more 
easily reach a certain FC threshold, and the inverse is 
true for highly expressed genes. A non-linear model, the 
Limit Fold Change model (LFC), has been developed to 
identify DE genes based on this relation [22]. We have 
applied a similar LFC model on our NanoString data. 
We did not use the LFC model as a tool for identifying 
DE genes but to set appropriate FC thresholds for genes 
with various ranges of expression levels, in order to 
avoid a subjective choice of FC criterion. The FC 
thresholds that we thus derived were then used in a 
subsequent analysis relative to a threshold in order to 
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identify the DE genes. We refer to this setup as the 
running FC model, and illustrate its use and benefit on 
biological data. 

Results 

Biological data of odorant receptor gene expression 

The main olfactory epithelium is located in the nasal 
cavity of the mouse, and detects volatile chemicals 
(odorants) in the inhaled air. The sense of smell must 
detect chemical stimuli with an immense variety in 
physicochemical properties. To accommodate this broad 
recognition, the mammalian olfactory system has evolved 
a large repertoire of molecular receptors, odorant receptors. 
These receptors are expressed by olfactory sensory neurons 
in the main olfactory epithelium. In the mouse, -1200 
odorant receptors are encoded by distinct genes, which 
form the largest gene family in its genome. It is widely 
accepted that a mature olfactory sensory neuron ex- 
presses only one of the -1200 odorant receptor genes. 
The molecular and genetic mechanisms that regulate this 
one receptor - one neuron rule remain unclear, and are 
the focus of our research. We have demonstrated the role 
of the H element [23] and the P element [24] in the regu- 
lation of expression of clusters of odorant receptor genes 
by genetically engineering mouse strains that lack the H 
element [25] or the P element [19]: these are the AH and 
AP strains and the AHxAP double knockout strain [19]. 
Another mouse strain is the A01fr7A strain [19]; by means 
of chromosome engineering [26], we excised a 2.4 mega- 
base region on Chromosome 9 that contains the Olfr7 
cluster consisting of 99 OR genes [27]. We have also char- 
acterized temporal expression patterns of 531 odorant 
receptor genes in adult and aged mice [20]. 

Here we used datasets [19] from a NanoString analysis 
of 558 OR genes comparing knockout versus wild- type 
mouse strains. Specifically, we used NanoString data 
obtained from six mutant mice of the AHxAP strain 
(cartridge MK29) compared to 12 control (wild- type) 
mice (cartridges MK29 and MK37, six mice each); these 
18 mice are in a mixed genetic background, C57BL/6 J x 
129/SvEv. Another dataset was obtained from six mice 
of the A01fr7A strain, compared to six control (wild-type) 
mice (cartridge MK38); these 12 mice are in pure genetic 
background, 129/SvEv. With NanoString CodeSet Gorilla, 
we determined the RNA abundance for 558 OR genes 
from 1 [ig RNA of whole olfactory mucosa tissue sam- 
ples. Each lane of a NanoString cartridge represents a 
different RNA sample and mouse. Thus, there are 6-12 
biological replicates per biological condition, and no 
technical replicates. 

Approaches relative to a FC threshold 

Our novel method tTREAT is similar to TREAT [17]. It 
is applied to the regular student s t-statistic, and requires 



a predefined FC threshold r. For the two-stage design, 
tTREAT2, a second threshold ft with ^>t, is used in a 
first "stop and go" stage in which it is decided whether a 
gene is non-DE (stop) or whether it can proceed to the 
next stage (go). Then, for all the "go" genes, a tTREAT 
test with FC threshold r is applied. In the running FC 
model, a non-linear model for FC versus average gene 
expression is first used to determine various FC thresh- 
olds Ti for a number of ranges of expression levels. 
Genes are then binned in k gene expression levels, and 
the appropriate r/ is used per concentration bin in a 
subsequent analysis relative to a FC threshold. 

Simulated data 

Because NanoString data are not yet as widely studied as 
microarray data, we have conducted a data simulation 
procedure that represents one of our typical two-group 
comparison NanoString experiments. 

The procedure for simulating one NanoString dataset 
was conducted according to the following steps and 
distributional assumptions: 

• To get a general idea about the variances of 
NanoString gene expression data, the genes in the 
AHxAP dataset were used as an example gene 
population. Biological data from AHxAP mice 
were chosen, as they represent a noisier dataset 
due to the mixed genetic background of this strain 
[19]: the resulting simulated data will not 
represent the cleanest example. The AHxAP 
dataset was used only for the next step of the 
simulation exercise. 

• Subsequently, 100 real variances across genes, (T^, 
were drawn from an inverse-gamma distribution: 
Inv - Gammaigi,g2). The parameters and ^2 are 
estimated by a Maximum Likelihood procedure in 
the above described gene population. 

• Each of the 100 gave rise to three randomly 
drawn real differences ^g. The and three 
corresponding were included in one of the 
following three gene groups: 

(Group 1) DE genes: The were drawn from a 
Gaussian ~A/'(0, (XgV start) ^tnd had to 
satisfy the criterion: \^g\ > log2{o)). 

(Group 2) Non-DE genes, noisy: Similarly, ^g 
drawn from ~A/'(0, cfgVstart) but with 
\/Sg\ < log2{o)). 

(Group 3) Non-DE genes, regular: Obviously for 
this group the real ^g were set to 0. 

Note that empirically, the normal distribution 

seems acceptable for the ^g, based on our 

NanoString data. 

• The ^g that were produced in the previous step 
served as true differences that were then 
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subsequently used to simulate three possible 
estimates from a Gaussian -Af(jS^^ ^<g\/%)' 
Similarly residual variances ^were drawn from a 
Chi-square distribution: - ^ (^<?) 

Since the quantity o) used to initiate the simulation as 
described above defines the DE genes by the rule > 
log2{co), the distributional center of the of DE genes 
actually lies a little further than the actual value of co, 
depending on the value of their variance: Og^JZg, Hence- 
forth, we will refer to 6) as the PC (xi with respect to 
which the data was simulated. 

Every such simulated dataset was thus initially based 
on 100 cr^ leading to a total of 900 genes. The regular 
non-DE genes (group 3) were fixed as 20% of the 900 
genes. However, to assess whether the number of DE 
genes in a dataset influences the results, the percentage 
of DE genes (and thus also the percentage of the noisy 
non-DE genes) was varied between 1% and 40% (noisy 
non-DE genes: 79% and 40%, respectively). 

On the simulated data, the following statistical ap- 
proaches for identifying genes that are DE with regard to 
a certain PC threshold were assessed: 

(1) TREAT with regard to PC threshold r 

(2) tTREAT with regard to PC threshold r 

(3) tTREAT2 with a bilateral p value calculation in the 
second stage with regard to PC thresholds Q 
(stage 1) and r (stage 2) 

As the two-stage design tTREAT2 relies on the choice 
of two related PC thresholds, one for each of the two 
stages, it is not straightforward to compare the perform- 
ance of tTREAT2 to TREAT and/or tTREAT. Therefore 
we have derived various experimental schemes to show 
in which situations the use of a two-stage design like 
tTREAT2 can be beneficial. 

Results on simulated data 

We simulated 400 realizations of a two-group compari- 
son NanoString experiment with 900 genes for which 
the DE genes are simulated with respect to a certain PC 
difference 6) as described above. As such we know 
beforehand to which group (DE or non-DE) each gene 
in these 400 datasets belongs. We can therefore assess 
the performance of the procedures described in the 
methods section. Por this purpose, we fix Vstan = 8 and 
the significance (type I error) was set at 0.05 or at 0.01. 
The mean of the false discoveries (false positives) and 
missed genes (false negatives) over the 400 generated 
datasets is plotted for the various statistical approaches 
for PCs Ct) (the PC difference with respect to which the 
data are simulated) and r (the actual PC threshold used 



in the testing procedure). We also report mean Area 
Under the ROC-curve (AUC) values over the 400 gener- 
ated data sets. 

Pigure 1 and Additional file 1 compare TREAT and 
tTREAT on a simulation exercise where the PC differ- 
ence (xi and the PC threshold r were chosen to be equal. 
We tested values for a; = r of 1.3 (Pigure lAi and IA2), 
1.5 (Pigure IBi and IB2) and 2 (Pigure ICi and IC2). 
Across the various percentages of DE genes (indicated 
on the X axis), the percentages next to the gray and black 
bars of tTREAT in Pigure 1 represent the percentual de- 
crease (prefixed with a minus sign) or increase (prefixed 
with a plus sign) in false discoveries or missed genes 
with respect to the reference, TREAT. We find that 
tTREAT always results in fewer false discoveries than 
TREAT, at p = 0.05 and p = 0.01. Moreover, when p = 0.05 
is applied as a significance cut-off (Pigure lAi^ IBi and 
ICi), tTREAT decreases the false discoveries for only a 
few more missed genes compared to TREAT (Additional 
file 2). Por p = 0.01 (Pigure IA2, IB2 and IC2) the latter is 
true for datasets with small percentages of DE genes but 
for datasets with -10% of DE genes, the decrease in false 
discoveries is nullified by the increase in missed genes 
(Additional file 2). In terms of AUC performance there is 
no difference between TREAT and tTREAT across the 
percentages of DE genes (Table 1). 

In order to illustrate the added value of the safety 
margin around the PC threshold r that has been chosen 
for statistical analysis, two simulation experiments were 
done. In the first experiment, the data were simulated 
such that the actual DE genes had a PC difference a; of 
at least 1.5 (up or down), and p was set at 0.01. A blind, 
stringent tTREAT test at the higher PC threshold r of 
2.5 (Pigure 2A) will result in a high number of missed 
genes (dark blue bars, Pigure 2B). The reference test for 
this scenario is a tTREAT with PC threshold r = a; = 1.5, 
represented by the black and gray stacked bars in Pigure 2B. 
(See Additional file 3A for stacked bars across 40 per- 
centages of DE genes). The stringent tTREAT (r of 2.5) 
decreases, as expected, the false discoveries, by 90% in 
comparison to the reference test (cyan vs gray bars, 
Pigure 2B) but at the cost of a 50% increase in missed 
genes (dark blue vs black bars, Pigure 2B). When a 
two-stage approach is used with PCs Q and r chosen at 
2.5 and 1.5, the high number of missed genes of the 
blind, stringent tTREAT test (with PC threshold r of 
2.5) is reduced (red vs dark blue bars, Pigure 2B). At 
the same time, the two-stage approach has a protective 
effect over the reference tTREAT test (PC r = 1.5) by 
decreasing the number of false positives (orange vs grey 
bars, Pigure 2B). The stringent test has a much lower 
AUC than the reference test and two-stage approach 
(Table 1). The two-stage approach has the best overall 
performance (highest AUC value) for this experiment. 
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(See figure on previous page.) 

Figure 1 False discoveries and missed genes for TREAT and tTREAT on simulated data, at p = 0.05 or at p = 0.01. (Ai) Tlie positive y axis 
sliows tine average of tlie false discoveries, and tine negative y axis sliows tine average of tine missed genes for TREAT and tTREAT on 400 
simulated datasets. The x axis shows the percentage of DE genes (1%, 10% and 20%) that is simulated in each case. The data are simulated with 
respect to a EC difference oj of 1.3 (up or down), and the EC threshold r used for TREAT and tTREAT is also 1.3. The percentages next to the gray 
and black bars of tTREAT represent the percentual decrease (prefixed with a minus sign) or increase (prefixed with a plus sign) in false discoveries 
or missed genes with respect to the reference, TREAT (depicted in cyan and blue). The significance cut-off was set at p = 0.05. (A2) Same conditions as 
in panel Ai, except that significance was set at p = 0.01 . (Bi) Similar to panel Ai, but now the data are simulated with respect to a EC difference co of 
1 .5, and the EC threshold r used for TREAT and tTREAT is also 1 .5. The significance cut-off was set at p = 0.05. (B2) Same conditions as in panel Bi, 
except that significance was set at p = 0.01 . (Ci) Similar to panel Ai, but now the data are simulated with respect to a EC difference 00 of 2, and the EC 
threshold r used for TREAT and tTREAT is also 2. The significance cut-off was set at p = 0.05. (C2) Same conditions as in panel Ci, except that 
significance was set at p = 0.01 . 



A second simulation experiment focuses on the in- 
verse situation. Data are simulated with respect to a FC 
of 0) = 2.5 (up or down), and p = 0.01. One tTREAT test 
aims too low with a FC threshold of 1.5 (purple dashed 
lines in Figure 3A), leading to many false discoveries. 
The reference test for this scenario is a tTREAT with FC 
threshold t = o) = 2.5, represented by the black and gray 
stacked bars in Figure 3B. (See Additional file 3B for 
stacked bars across 40 different percentages of DE genes). 
A two-stage approach decreases the number of false 
positives that are obtained by the non-stringent test 
with too low a FC threshold (orange vs light blue bars. 
Figure 3B) while maintaining the false negatives (red vs 
dark blue bars. Figure 3B). The AUC values show that 
the overall performance of these three tests is very 
similar (Table 1). 

Benefit of tTREAT2 for a biological dataset 

We analyzed samples of six A01fr7A mutant mice in 
comparison to six control mice with our NanoString 
Gorilla CodeSet containing 558 OR genes. The Olfr7 
cluster contains 99 OR genes, and for 40 OR genes we 
could design specific probes. Of these 40 genes, 37 were 
defined as informative, as the median of the normalized 
NanoString counts in the control mice is at least 100. 
Because these 37 OR genes are not present in the gen- 
ome of the mutant mice, the counts for these OR genes 
in mutant mice should be at background levels, and 
these genes are a priori DE. The A01fr7A strain is thus 
the ultimate negative control for probe specificity. Be- 
cause the maximum counts for the negative controls 
is -50, we reasoned that even for expressed Olfr? cluster 
genes with low normalized counts (-130), we could 
expect a > 2-fold change. We thus set the FC threshold r 
at 2 in a first tTREAT analysis, with p = 0.01. The MC 
plot in Figure 4A shows the results: of the 37 genes in 
the deleted Olfr? cluster, 36 are detected as having a FC 
significantly lower than Vi, but one deleted gene, Olfr916, 
is missed. We lowered the threshold to r = 1.3 in a 
second tTREAT analysis: a total of 39 genes is now 
identified with a FC significantly lower than 1/1.3 or 
higher than 1.3 (Figure 4B). But two genes with a FC 



higher than 1.3 turn out to reside outside the deleted 
cluster: Olfr98S, which could perhaps make sense given 
it resides close to the deleted region, and Olfr447, which 
is on chromosome 6 and ought not to be affected. Im- 
portantly, when we apply a two-stage approach, tTREAT2 
with 6 = 2 and r = 1.3, the 37 genes of the deleted cluster, 
and only these genes, are found to be DE (Figure 4C): no 
genes are missed, and no genes outside the cluster are 
identified as DE. The experimentally rare case of a dele- 
tion mutant, in which certain genes are not present, 
thus enables us to demonstrate the benefit of tTREAT2. 



The running FC model 

We applied the running FC model on our NanoString 
data obtained from six AHxAP mutant mice and 12 
control mice. As these mice are from a mixed genetic 
background, we used a less-stringent tTREAT with r = 1.5, 
and p = 0.01. The results are shown as an MA plot in 
Figure 5A and as an MC plot in Figure 5C. We find that 
nine genes from the P cluster and 12 genes from the H 
cluster have a FC significantly lower than 1/1.5. But two 
additional genes, Olfr362 on Chromosome 2 and Olfr 107 
on Chromosome 17, are identified as DE by tTREAT with 
r = 1.5. When we apply a running FC model, a FC thresh- 
old value r is not chosen. Instead the model calculates a 
different r for ten gene expression levels. These values are 
>1.5 for the lower expression levels and drop to 1.36 for 
the very high expression levels (Figure 5B). After calculat- 
ing the r values, we applied them together with a tTREAT 
test in our running FC model. This model identified 23 DE 
genes, 22 of which were also identified by tTREAT with 
r = 1.5 (Figure 5D). The lowly expressed gene Olfr 107 (on 
Chromosome 17) is no longer identified as DE by the 
running FC model, but the model identified an additional 
OR gene, Olfr695, which is plausible as it resides within 
the P cluster. 

The advantage of tests relative to a FC threshold over 
combined approaches (statistical test outcome + FC cri- 
terion) is illustrated in Figure 5A. By combining the 
regular t-test and a FC criterion of 1.5, a total of 49 
genes are identified as DE. Of these, 20 reside outside 
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Table 1 Area under the ROC curve (AUC) for various 
methods on simulated data 

AUC results completing Figure 1 
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the H and P clusters (pink circles in Figure 5A), and are 
thus in reality most likely not DE. 

Discussion 

Gene expression has been studied extensively for a wide 
variety of biological problems. Over the years various 
techniques have been developed for and applied to gene 



expression analysis. Ordered from low to high throughput, 
these techniques include qPCR, NanoString, microarrays, 
and RNA sequencing. These techniques come with their 
own specifications, advantages and disadvantages [21,28] 
but each also with specific challenges for statistical ana- 
lysis. Microarrays have thus far been the most widely used 
approach to study gene expression, and have therefore led 
to numerous analytical tools and statistical tests. Many of 
these tools can also be applied to NanoString data or 
qPCR data, and some have resulted in practical software 
packages for NanoString [29,30] . 

The moderated t (referred to as t*) based TREAT, 
which was developed for microarray data [5,17] works 
well on our NanoString data. But we found that 
tTREAT with regular t is even more appropriate. 
Moreover, we propose an alternative approach that 
defines a safety margin around the FC threshold. This 
tTREAT2 application in two subsequent stages can be 
beneficial when data are expected to be noisy around 
a chosen FC threshold. We also developed a tech- 
nique for finding FC thresholds that vary with the ex- 
pression level of genes, followed by applying these 
objectively set thresholds in tTREAT. In principle, this 
running FC model can be applied to gene expression data 
obtained with any technique. 

The goal of this study is not to determine which 
test has the highest performance on NanoString data, 
although we describe the overall performance (AUC 
values) of TREAT, tTREAT and tTREAT2 on simu- 
lated data. Statistical tests like TREAT, fine-tuned to 
gene expression data, are difficult to outperform. In 
certain situations, however, an alternative test may 
prove more beneficial for the technology that is used 
and the problem that is studied. In our studies of 
odorant receptor gene expression, a falsely discovered 
gene leads to extensive follow-up experiments such as 
in situ hybridization and even generation of a new 
mouse strain by gene targeting. Such follow-up exper- 
iments are costly, lengthy, and time-consuming. We 
thus need to minimize false discoveries. When using 
NanoString data, we find that tTREAT protects more 
against false discoveries at significance cut-offs of p = 
0.05 and p = 0.01. It is perhaps counterintuitive that 
tTREAT instead of TREAT is more beneficial for our 
dataset, as TREAT and the moderated t statistic are 
known for their ability to decrease false discoveries in 
microarray experiments. 

There are four main messages. First, we demonstrate 
with data simulations that tTREAT is more appropriate 
for our NanoString data than TREAT. We used bio- 
logical data to initiate data simulation experiments, and 
find that tTREAT results in fewer false discoveries com- 
pared to TREAT when using significance cut-offs of 
p = 0.05 or p = 0.01. But when the percentage of DE 
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Figure 2 The two-stage design in a stringent test situation. (A) Data simulation experiment: empirical density functions of the DE genes (solid 
curve), noisy non-DE genes (dashed curve), and non-DE genes (dotted curve). The vertical olive lines indicate the simulated FC difference oj between test 
and control groups. The choice of w determines the peak of the density function of the DE genes, which lies a little further. The non-DE genes are 
distributed around a FC difference of 0. A stringent test relative to a FC threshold r, as indicated by the purple dashed lines, identifies only the DE genes 
to the right or left of these lines. (B) The positive y axis shows the average number of false discoveries, and the negative y axis shows the average 
number of missed genes, over 400 generated datasets for three tests relative to a FC threshold. The x axis shows the percentage of DE genes (1%, 10% 
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genes in the simulated data is high, the advantage of 
tTREAT over TREAT in terms of false discoveries is 
countered by tTREAT missing more DE genes. When 
the overall performance of TREAT and tTREAT is com- 
pared by AUCs (thus independently of the chosen p 
value), there is no difference on our simulated data. The 



TREAT Bayesian approach shrinks (or blasts) the indi- 
vidual gene sample variances towards a pooled estimate, 
which works particularly well when the number of repli- 
cates (arrays) is small; often no more than two replicates 
are available for microarrays [5]. The NanoString 
counts in our dataset are not noisy and the 
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throughput is much lower (558 genes) compared to 
a microarray experiment. So some of the gene-wise 
variances may be shrunk (or blasted) by too large a 



factor when applying TREAT on our NanoString 
data, as the hierarchical Bayesian model is expecting 
a few outlier variances. 
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Figure 4 The two-stage design on biological data: A0[fr7A 
mice. (A) MC plot of 558 OR genes. Filled squares represent genes 
that are identified as DE by tTREAT with an FC threshold r of 2, 
comparing A0lfr7A mutant mice to control (wild-type) mice. The black 
stippled rectangle encompasses the 37 genes in the OlfrJ cluster 
deletion; these genes do not exist in the genome of A0lfr7A mice, and 
are thus obviously DE. The black arrow indicates a gene {Olfr916) in the 
O/frZ cluster deletion that was missed, thus that was identified as 
non-DE. Empty circles represent non-DE genes. (B) MC plot of 558 
OR genes. Filled squares represent genes that were identified as DE by 
tTREAT with a FC threshold r of 1.3. The black stippled rectangle 
encompasses the 37 genes in the Olfr7 cluster deletion. The black 
arrows indicate two OR genes that were identified as DE, but reside 
outside of the Olfr7 cluster deletion, and therefore could be false 
discoveries. Because Olfr985 is close to the genomic deletion, its 
expression could be affected. Empty circles represent non-DE 
genes. (C) MC plot of 558 OR genes. Filled squares represent genes 
that identified as DE by tTREAT2 with FC thresholds 0 = 2 and r = 1 .3. 
The black stippled rectangle encompasses the 37 genes in the Olfr7 
cluster deletion. Empty circles represent non-DE genes. tTREAT2 now 
identifies the 37 genes in the deleted cluster as DE, and no other gene: 
there are no missed genes, and no false discoveries. 



Second, when noisy data are expected, with many 
non-DE genes showing signs of being increased or de- 
creased, it is beneficial to use a safety margin around the 
chosen FC threshold in an analysis relative to a FC 
threshold. A good choice seems to set ^ to r -i- 0.5 or r -i- 1. 

Third, the particular choice of a FC threshold can 
influence the interpretation of biological data [16]. 
The running FC model can be applied in order to 
avoid the subjective choice of a FC threshold, or 
when data vary strongly with expression levels. We 
used the same dataset to determine the FC thresholds 
for various gene expression levels and to apply 
tTREAT, which may seem suboptimal. It would be ideal 
to determine FC thresholds on an independent dataset be- 
fore using these thresholds in the subsequent analysis of 
the actual data. But often obtaining these independent 
data may be too expensive or impossible. We believe that 
performing the running FC model on the same dataset 
gives accurate results despite its data-driven nature in 
such circumstances. 

Fourth, the choice of analytical tools must be driven 
by the research goals, the gene expression technique, 
and the hypothesis that is being tested. TREAT, 
tTREAT, or tTREAT2 either give a high number of 
false discoveries and few missed genes, only a few 
false discoveries for a higher number of missed genes, 
or an average but similar number of false discoveries 
and missed genes. The inevitable trade-off between false 
discoveries and missed genes must be evaluated 
dependent on the experimental context. In our case - 
NanoString studies of odorant receptor gene expression - 
we want to minimize the false discoveries, because the 
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Figure 5 Running fold change model on biological data: AHxAP mice. (A) MA plot of 558 OR genes. Filled green squares represent genes 
that were identified as DE by tTREAT with a FC threshold of 1.5, comparing AHxAP double mutant mice to control (wild-type) mice. Empty circles 
represent non-DE genes. The empty pink circles indicate genes that satisfy the combined criteria of a significant p value for a regular t-test (p < 0.01) 
and a FC value > 1.5 or < 1/1 .5. The black vertical stippled line indicates the maximum of the negative controls. The red horizontal stippled line is the 
mean of the M values. The yellow stippled lines are equal to M values that represent 1 .3 fold up or down: ± logjO .3). (B) MA plot of 558 OR genes. 
Filled green squares represent genes that were identified as DE by a running FC model based on tTREAT. Empty circles represent non-DE genes. The 
grey vertical stippled lines delineate ranges of gene expression, for which various FC thresholds have been used. Orange numbers indicate the FC 
thresholds that are applied in the subsequent tTREAT analysis. (C) MC plot corresponding to the results shown in panel A. M values are arranged 
according to the relative gene order along the chromosomes, which are indicated in various colors. The black arrows indicate two OR genes that have 
been identified as DE, but reside outside of the H and P clusters; these are most likely false discoveries. (D) MC plot corresponding to the results shown 
in panel C. The solid black arrow indicates an OR gene that has been identified as DE, but resides outside of the H and P clusters, and is most likely a 
false discovery. The dashed black arrow indicates an OR gene that is identified as DE and resides within the P cluster; this gene is missed in tTREAT. 



validation and follow-up experiments are laborious and 
expensive. But for other projects, the inverse may be desir- 
able: the number of missed genes needs to be minimized, 
for instance in large-scale screening of molecules against a 



drug target, where missing a potential blockbuster drug 
cannot be afforded. 

In the running FC model, the choice of the FC per- 
centile in each bin in the development of the LFC model 
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(step 1) as well as the choice of the k bins when applying 
the model (in step 3) affect the final results of the run- 
ning FC model, thus the DE genes discovered. Depend- 
ing on the percentage of DE genes that is expected, 
percentile values ranging from P55 (many DE genes are 
expected) to P95 (a few DE genes are expected) are 
deemed acceptable. Regarding the number of k bins, a 
good coverage of the full expression range of the genes 
under analysis should be ensured. 

Performance results on a single simulated dataset 
(instead of 400) would have been subject to chance 
(in favor or disfavor of one method). Repeating the sim- 
ulations 400 times permits an understanding of the vari- 
ability in test performance that may be obtained between 
different simulations. The variability of our mean results 
is assessed by the 95% prediction intervals around the 
mean number of false discoveries (missed genes re- 
spectively) on these 400 data simulations. Neither of 
the statistical tests shows outliers in performance that 
could contradict our conclusions. Our data simulation 
procedure requires a "real" data input at the beginning 
only to get an idea about the distribution of gene error 
variances for NanoString. This input came from our 
data as described in the results. We also used an inde- 
pendent, published NanoString data set with -250 mouse 
genes (cell-cycle and G2/M related functions) [31] to 
initiate the set of data simulation experiments again. 
We find that simulations initiated by these data lead to 
very similar results (data not shown). 

We present approaches for testing the significance of 
gene expression relative to a FC threshold, with a focus 
on our NanoString data on OR genes. When carrying 
out these hundreds of significance tests simultaneously, 
the issue of multiple testing can be raised. A significance 
analysis for microarrays (SAM) test has been developed 
[3] that encompasses both a test statistic for ranking 
(a slightly tuned version of students t) and a way to 
control the False Discovery Rate and thus to adjust for 
multiple testing. The advantage of our approaches is that 
they provide p values that can be adjusted by a multiple 
testing method of choice, ranging from Bonferroni to 
Benjamini and Hochberg [32]. There is a wide variety of 
adjustment procedures for multiple testing. The choice 
of the most appropriate procedure for a given analysis is 
not straightforward. Guidelines and criteria to be attrib- 
uted to the multiple testing approaches have been 
described [2]. In general, the formulation of a composite 
null-hypothesis forces TREAT, tTREAT, and tTREAT2 
to rely on a more conservative way of p value calculation 
[17]. The distribution of the resulting p values is thus 
skewed towards the larger values, which has conse- 
quences: (1) if a method for adjusting multiple testing 
problems relies on the uniformity of the distribution of 
the p values, it cannot be used as this criterion is not 



met and (2) we feel that when applied to NanoString 
data (for which the technical maximum is 800 genes), 
these already conservative p values should not be ad- 
justed by yet another conservative method. 

For the development of TREAT and initially also for t* 
[5,17], it was decided to use gene-wise linear models 
with arbitrary coefficients and contrasts of interest (mak- 
ing them widely applicable) as a contextual and prac- 
tical background; see also the limma package [12]. The 
tTREAT, tTREAT2, and the running FC model are also 
based on gene-wise linear models. Others [8] have used 
linear models, or, more precisely, analysis of variance 
models (ANOVA) to assess DE genes in microarray ex- 
periments. The main difference is that all genes are in- 
cluded in a single linear model, presenting the advantage 
that the normalization process is combined with the data 
analysis and thus pre-normalization of the data is not 
necessary. Obviously for such an ANOVA model to be 
justified, its underlying model assumptions should be met, 
but they can be assessed quite straightforward. For a 
smaller NanoString CodeSet Century [19], which consists 
of 89 OR genes and 11 reference genes, we developed 
ANOVA models that also dealt with assessing the DE 
genes relative to a FC threshold. The negative controls 
could not be included in these ANOVA models, as their 
residuals were making the general residual distribution 
heavily tailed and therefore non-Gaussian. Equality of 
variances was not met but within an acceptable borderline 
range on these genes. The results of these ANOVA models 
were not satisfactory, as the genes that were identified 
as DE genes (with FCs significantly higher than a certain 
predefined threshold) were sometimes very different 
from the TREAT, tTREAT and tTREAT2 results, and 
less biologically meaningful. ANOVA models including 
various genes in the same linear model can be applied 
on NanoString data and may be very useful for certain 
purposes. On our data, however, the gene-wise linear 
models and thus TREAT, tTREAT, or tTREAT2 were 
preferred over an approach that models all genes 
simultaneously. 

Conclusions 

Our gene-wise statistical analysis of gene expression data 
with significance relative to a FC threshold gives repro- 
ducible and reliable results on NanoString data of odor- 
ant receptor genes, the largest gene family in the mouse 
genome. Because it is difficult to set a biologically mean- 
ingful FC threshold in advance, we have developed 
methods that provide guidance in determining a stable 
FC threshold (in order to minimize false discoveries and/ 
or missed genes) or in avoiding this choice altogether. 
By applying a two-stage approach, a safety margin around 
the FC threshold can be set, which is beneficial in cer- 
tain situations. Our running FC model identifies FC 
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thresholds in an automated way, and is thus a more ob- 
jective model leading to results that are more reprodu- 
cible. This model is well suited for the problem of 
higher FC variances of lower expression values, thus 
avoiding a bias. 

Methods 

NanoStrIng data 

A NanoString experiment is performed using one or 
more NanoString cartridges, each containing 12 lanes. 
In each lane one RNA sample is assayed. The cartridge 
is scanned and imaged, resulting in digital readings (raw 
counts) for every barcode (gene) in every lane [18]. A 
macro can then be used to collect raw counts of car- 
tridges of interest into an xls file. We imported these 
collected xls files into the R environment, version 2.14 
[33]. Counts were processed to eliminate systematic ex- 
perimental variability, differing amounts of input RNA, 
and variability in background, in three steps: first a 
normalization with respect to the geometric mean of the 
positive control spike counts, then a normalization with 
respect to the geometric mean of a group of five refer- 
ence genes, and finally a background correction, which 
consists of subtracting the mean + 2''SD (standard devi- 
ation) of the negative control counts. Count values <1 
were fixed to background level. We define a gene as 
informative if the median of the normalized NanoString 
counts in wild-type control mice is >100. 

The FC for gene g can be estimated for any two sam- 
ples as the ratio of the geometric mean of the normal- 
ized counts of first sample (^Xg^^ over the geometric 

mean of the second samples counts : FC = 

As ratios are naturally heavily skewed, log2{FC) values, 
referred to as M values, are represented in most figures 
and analyses instead of the FCs. In MA plots, M values 

are plotted against the quantity A, ^/o^2 ^\/%^ • ^ ' 

which represents an average of the NanoString counts 
for that gene. In MC plots, M values are plotted accord- 
ing to relative gene order on the chromosomes. 

Usage of the limma package 

A linear model is fitted to the NanoString expression data 
(digital counts) for each gene, which is key in the limma 
package [5,12]. We use the log2 of digital NanoString 
counts as expression data for every lane for the limma ap- 
plication. Thus for a set of 1 lanes (assumed independent), 

there is a log2 counts vector = (^y^^ , • • • , y^j^ for every 

gene g. For the linear model on any gene g, we write: 
E[yJ = Xi//g. The experimental design (i.e. which biological 
replicates belong to which RNA sample) is captured by 



the design matrix X, and i/Zg represents the unknown vec- 
tor of coefficients. What we are interested in, are 
certain contrasts, as given by the contrasts matrix C (for 
example C^= (1, - 1) captures a two RNA group com- 
parison): /3g= C^il/g. By fitting the linear model, the ij/g 
will be estimated which then easily leads to the esti- 
mates of the ^g. Linear model theory also shows that 

Cov l^i^^j = (r^^g , so the variance-covariance of the 

contrasts is deduced as follows: Covj^^J = o^C^WgC, 

for ease of notation let s put C^V^C = Z^. Therefore, the 
students t-statistic for the contrasts in question is then typ- 
ically written as follows: tg = where Sg represents the 
variance estimator of the unknown gene- wise variance. 

tTREAT 

Let r be the FC threshold as predefined and Mg the 
log2{FC) of a gene g. If we take the example of a two 
RNA group comparison, Mg =yg^-yg^y i-e. the differ- 
ence in arithmetic means of the log2 NanoString counts 
of the two groups. It can easily be seen how this com- 
parison now comes down to a contrast when applying a 
linear model to gene g. 

The idea of an analysis relative to a FC threshold is to 
test the following composite hypothesis: Hq : \Mg\ < log2 
(r) against Hi : \Mg\ > log2{T), Thus, Hq is an interval of 
values for Mg rather than a single value, which is nor- 
mally 0. McCarthy and Smyth [17] determined the exact 
(and conservative) p value for this Hq on t* (called 
TREAT). We reformulate their ideas for t instead of 
as follows: Let ^^fe be the observed value of the t-statistic 
for a certain gene g, the p value now equals: p = P{\T\ > 
tobs I ^o}« As Hq is an interval, we may find an upper 
bound of this p value by choosing the element of Hq that 
is the most difficult to reject (hence conservative). So, 
let Mobs be the observed value of Mg and choose 
Mq = min(/o^2(^)7 l^o^sl) • Then the actual p value is 
bounded above by p<P{\T\> tots I = Mq}, Let S= ^ , 

then if tobs^O, p < 2 [P{T > tots - S\ M^^Mq}] and if 
tobs <0, p<2''[P{T< tabs ^^\Mg = Mq}]. In order to dif- 
ferentiate our test from t* and TREAT [17], we refer to 
it as tTREAT. 

tTREAT2 

In this two-stage design, we reconsider the relative to 
a FC threshold idea at the point where the composite 
null hypothesis is formulated: Hq : \Mg\ < log2{T) against 
Hi : \Mg\ > log2{T). The interval to which Hq applies, 
is thus Hq: - log2{T) <Mg< log2{T), A larger interval, 
called I, is now defined around Hq, for example I: 
[-^og2{6), log2{6)] with 6> T, Then, in a first stage. 
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called the stop or go stage, a 100% x (1 - astagei) 
confidence interval (CI) for Mg is defined as usual: 

[^g-^df^-^^^-sei/^ - Sg^.Mg + t^f^^_asta,ey^. ^ , where 

df stands for the residual degrees of freedom. If this 
100% X (1 - a^tagei) CI for Mg falls strictly within I, it 
is decided to call the gene g as non-DE (stop), other- 
wise the gene g goes on to the second stage (go). For 
all the go genes, the second stage p value calculation 
is done exactly as for the tTREAT part above with 
regard to the interval defined by Hq, This approach 
allows for more flexibility, particularly when there is 
concern that the chosen tTREAT PC threshold may be 
too high. In such a case, 6 could be set to a high 
value, and a lower value, e.g. 6-1 or ^-0.1, could be 
chosen for r. In a second stage, the type I errors of 
both stages, a^tagei (defining the CI) and a^tagei (sig- 
nificance of tTREAT analysis on go genes), can be 
chosen freely and may differ from each other. Here we 
have chosen agtagei to equal 0.05. The value of 6 (and 
thus the length of the interval I) and its difference 
with regard to r allow for a safety margin around r, 
thus helping to reduce false positives in noisy data 
while keeping the false negatives fixed or reducing 
them as well. 

Running FC model 

By using a fixed FC threshold, genes with lower expression 
levels have a greater chance of being identified as DE 
because of their higher variance. In order to determine 
which FC threshold is appropriate for a certain expression 
level (or range of expression levels), we developed a 
running FC model. We used a model similar to [22] to 
find a potential FC cut-off function. Then we applied this 
function is to a number of ranges r of different expression 
levels, resulting in r different FC thresholds to use in any 
type of subsequently applied analysis. A step-by-step de- 
scription of the running FC model is as follows. 

1. Discrete relationship between FC and average gene 
expression levels: 

Any two-by-two comparison of RNA groups is consid- 
ered separately in the model. For simplicity, assume 
there are only two RNA groups as above. For the run- 

~ 2 

ning FC model, the actual FC, FCg =^iis calculated for 

every gene g. When FCg is < 1, the reciprocal is taken. 
Then the Mg = log2{FCg) values are plotted against the 
average expression levels (ARg) of the reference group 
(say group 1, in our case these are the control mice), 
ARg = Xg^, Subsequently, the overall AR range is divided 
into bins of different widths but containing an equal num- 
ber of genes. In each bin j (/ = 1, • • • , m), the b%-percentile 
of the FCg for all genes in that bin is determined and named 
bpj. All these bpj can be visualized on the M-AR plot. 



2. Fitting a continuous function: 

As in [22], the equation bp = a -\- b/AR (where AR 
stands for the median of the ARg values in a certain bin 
j) seems to fit the data quite well. A least-squares fit of 
this equation is thus fitted to model the FC b-percentiles 
over various gene expression levels. 

3. Defining the appropriate FC thresholds for various 
gene expression levels: 

The model fit of step 2 is used to get actual FC thresh- 
old values for a number k of gene expression level 
ranges. In this case the bins of gene expression ranges 
are typically distributed more evenly over the full AR 
range, in contrast to the binning in step 1. For each of 
the i different ranges (/ = 1, • • • , /c) we apply the FCi, as 
defined by the model fit of step 2, to all genes in that 
specific range i of expression values as a FC threshold 
value for any kind of analysis (TREAT, tTREAT, or 
tTREAT2). The complete analysis is then referred to as 
the running FC model. 

Calculation of the Area Under the ROC curve (AUC) 

The R open-source package pROC was used to calculate 
AUC values of statistical tests [34]. 

Software for the tools 

A zip file of a folder containing the functions in R code 
[33] to use the analytic tools developed here (tTREAT, 
tTREAT2, running FC model, and MA and MC plots) 
is available as Additional file 4. The folder also contains 
an R script as an example of how to apply these func- 
tions. The ReadMe file illustrates and explains how to 
use the tools. 

Availability of supporting data 

The biological datasets (MK29, MK37 and MK38) support- 
ing the results of this article are available in the NCBI Gene 
Expression Omnibus [35], and are accessible through GEO 
Series accession number GSE53876: http://www.ncbi.nlm. 
nih.gov/geo/query/acc.cgi?acc=GSE53876. 

Additional files 



Additional file 1: False discoveries and missed genes for TREAT and 
tTREAT on simulated data, at p = 0.01. (A) The positive y axis shows 
the average of the false discoveries, and the negative y axis shows the 
average of the missed genes for TREAT and tTREAT on 400 simulated 
datasets. The x axis shows 40 different percentages of DE genes (ranging 
from 1% to 40%) that is simulated in each case. The data are simulated 
with respect to a FC difference oj of 1.3 (up or down), and the FC 
threshold r used for TREAT and tTREAT is also 1.3. The black percentages 
next to the gray and black bars of tTREAT represent the percentual 
decrease (prefixed with a minus sign) or increase (prefixed with a plus 
sign) in false discoveries or missed genes with repect to the reference, 
TREAT (depicted in cyan and blue). Significance is set at p = 0.01. (B) 
Similar to panel A but now the data are simulated with respect to a FC 
difference oj of 1.5, and the FC threshold t used for TREAT and tTREAT is 
also 1.5. (C) Similar to panel A but now the data are simulated with 
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respect to a FC difference oj of 2, ar^d the FC threshold t used for TREAT 
ar^d tTREAT is also 2. 

Additional file 2: Mean false discoveries and missed genes, 
together with 95% prediction intervals on simulated data. 

Additional file 3: Benefit of using the two-stage design in a 
stringent/non-stringent test situation. (A) The positive y axis shows 
the average number of false discoveries, and the negative y axis shows 
the average number of missed genes, over 400 generated datasets for 
three tests relative to a FC threshold. The x axis shows 40 different 
percentages of DE genes (ranging from 1% to 40%) that is simulated in 
each case. Significance is set at p = 0.01. The DE genes are simulated with 
respect to a FC difference oj of 1.5. Here, the reference test is the original 
tTREAT with a FC threshold r of 1.5, thus a test with r = oj (in gray/black). 
The stringent test with a FC threshold t of 2.5 is in cyan and blue. The 
tTREAT2 is in orange and red. (B) The positive y axis shows the average 
number of false discoveries, and the negative y axis shows the average 
number of missed genes, over 400 generated datasets for three tests rela- 
tive to a FC threshold. The x axis shows 40 different percentages of DE 
genes (ranging from 1% to 40%) that is simulated in each case. 
Significance is set at p = 0.01. The DE genes are simulated with respect to 
a FC difference oj of 2.5. Here, the reference test is the original tTREAT 
with a FC threshold t of 2.5, thus a test with r = oj (in gray and black). 
The non-stringent test with a FC threshold r of 1.5 is in cyan and blue. 
The tTREAT2 with FC thresholds 2.5/1.5 is in orange and red. 

Additional file 4: Software. A zip file of a folder containing the 
functions in R code to use the analytic tools developed here (tTREAT, 
tTREAT2, running FC model, and MA and MC plots). The folder also 
contains an R script as an example of how to apply these functions. The 
ReadMe file illustrates and explains how to use the tools. 
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